Goto

Collaborating Authors

 scene feature


BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, Niloy Mitra

Neural Information Processing Systems

The computer graphics pipeline has achieved impressive results in generating high-quality images, while offering users a great level of freedom and controllability over the generated images. This has many applications in creating and editing content for the creative industries, such as films, games, scientific visualisation, and more recently, in generating training data for computer vision tasks.



BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, Niloy Mitra

Neural Information Processing Systems

The computer graphics pipeline has achieved impressive results in generating high-quality images, while offering users a great level of freedom and controllability over the generated images. This has many applications in creating and editing content for the creative industries, such as films, games, scientific visualisation, and more recently, in generating training data for computer vision tasks.




ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Zhang, Yu, Li, Yunqi, Yang, Yifan, Wang, Rui, Yang, Yuqing, Qi, Dai, Bao, Jianmin, Chen, Dongdong, Luo, Chong, Qiu, Lili

arXiv.org Artificial Intelligence

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.


Where Do You Go? Pedestrian Trajectory Prediction using Scene Features

Rezaei, Mohammad Ali, Ayar, Fardin, Javanmardi, Ehsan, Tsukada, Manabu, Javanmardi, Mahdi

arXiv.org Artificial Intelligence

Accurate prediction of pedestrian trajectories is crucial for enhancing the safety of autonomous vehicles and reducing traffic fatalities involving pedestrians. While numerous studies have focused on modeling interactions among pedestrians to forecast their movements, the influence of environmental factors and scene-object placements has been comparatively underexplored. In this paper, we present a novel trajectory prediction model that integrates both pedestrian interactions and environmental context to improve prediction accuracy. Our approach captures spatial and temporal interactions among pedestrians within a sparse graph framework. To account for pedestrian-scene interactions, we employ advanced image enhancement and semantic segmentation techniques to extract detailed scene features. These scene and interaction features are then fused through a cross-attention mechanism, enabling the model to prioritize relevant environmental factors that influence pedestrian movements. Finally, a temporal convolutional network processes the fused features to predict future pedestrian trajectories. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving ADE and FDE values of 0.252 and 0.372 meters, respectively, underscoring the importance of incorporating both social interactions and environmental context in pedestrian trajectory prediction.


CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Dagan, Gautier, Loginova, Olga, Batra, Anil

arXiv.org Artificial Intelligence

Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.


MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion

Villar-Corrales, Angel, Austermann, Moritz, Behnke, Sven

arXiv.org Artificial Intelligence

Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.


Measuring and Modeling Physical Intrinsic Motivation

Martinez, Julio, Binder, Felix, Wang, Haoliang, Haber, Nick, Fan, Judith, Yamins, Daniel L. K.

arXiv.org Artificial Intelligence

Humans are interactive agents driven to seek out situations with interesting physical dynamics. Here we formalize the functional form of physical intrinsic motivation. We first collect ratings of how interesting humans find a variety of physics scenarios. We then model human interestingness responses by implementing various hypotheses of intrinsic motivation including models that rely on simple scene features to models that depend on forward physics prediction. We find that the single best predictor of human responses is adversarial reward, a model derived from physical prediction loss. We also find that simple scene feature models do not generalize their prediction of human responses across all scenarios. Finally, linearly combining the adversarial model with the number of collisions in a scene leads to the greatest improvement in predictivity of human responses, suggesting humans are driven towards scenarios that result in high information gain and physical activity.